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ABSTRACT ' ^ : . " 

This report delineates ,the theory and terminology of 
data compressidn. It surveys four data compression methods — null 
suppression, pattern substitution, statistical encoding, and - 
telemetry compress|.on— and , relates them to a standard statistical 
cod.ing' problem, i.e., the noiseless coding problem. The well defined 
solution to that problem, can serve as a standard on which to base the 
effectiveness of data 'compression methods, ihe simple measure 
des.cribed for calcul.ating the eff ectiveijess of a data ccBfression 
method is based on the characterization of the solution to the 
noiseless coding .problem. Finally, guidelines are stated concerning 
the relevance«of data' cojnpression to data processiiig applications. 
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- - Data -Compression *- A Comparison of Methods. 
. . Jules P.. Aronson • 

One important factor in system design and in 
the design of software is the cost of storing 
data.< Me'thod.s tnat, reduce storag^^'space can, be- 
sides reducing stdrage cost, be a critical factor 
m whether or not a specific, applicatiorr can be 
implemented. This '-paper surveys data compression 
^ methods and relates them td a standard statistical 
. coding probfem - the noiseless coding problem. The 
• well defined solution to that pr obi em^^can, serve as 
^ a standard on which to bas,e the- effectiveness of. 
data compression methods. A simple jnegsure^ based 
on the scharacterization of the solution to the 
('noiseless coding pro.blem, os stated through which 
the effectivenes-s of , a data compression methpd can 
be calculated. Finally,^ guidelines are stated con- 
cerning the relevance of data cimpression to data 
processing applications. - 

* ♦ ' ' •' ' ' . •. 

Key words: Coding; Coding Theory.; Computer 
btorage; Data. Compaction; Data Compression; Data 
Elements; ^Data Management; Data Processing; 
Jn.t^rmati'pjv^Man^ement; Inform^ation Theory ' 



Introduction _ .. r 

* * « * * ' 

.cies In^^vlfr^ Qf this report is to Assist" Federal Agen- 
n = ^?Kv ^^^^i°Pi"g <3ata element standards that are both com- 
Spedflcanv "Vh^'' government and - economical, 

tions ih^i ^ report responds, tq the GAO recommenda- 

tions that the Department of Commerce' . . issue policy, 

IreoaratJn^ accepted theory and \erminology, and provide fo^ 
i K guidelines,, nfethodol.ogy , and criteria to be 

nnri agencies in their standards efforts^'*. ■ TMs re- 

port delineates the theory and terminology of data compres- 
sion and s,urveys. classes' of data compreslion techniques^ , 
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Eff^tl^^r. t^^^^^^J E-^Ph^sis Needed On ■ Government's 
CniZli c Standardize Data Elemejvts And Codes For 
Compfiter Systems; May 16, 1.974T p33.. ^ • 



Data element standards Activities in the .past have been, 
concerned with. Abbreviatidhs , or codes for specific terms, 
such as the names -fot countrie^a^ metroBolitan areas', and 
states* Ihe purpose of such representations najs been to 
reduce tne space necessary to store such terms-, wh-ile main- 
taining "the aDil^ty to reproduce t^e telrms from tne 
representations. While each representation in a given class* 
IS unique, inter class uniquen-ess is' not necessar ily rqain^r^ 
tained. For example, the standard abbre'v iat ion for. 
CALIFORNIA IS CA (1), bjjt the abbreviation for^ CANADA is 
also CA (2) . The use of standard codes creates similar 
problems. The code fpr the geographical .area of. Alameda 
County, California is 06001' (3). - while that for the^ stan- 
dard metropolitan statistical area of, Augusta Georgia^ is; 
0600 04)'. TO distinguish between thege two codes, whenever 
they occur in the same file, is complicated and 'sometimes|. 
impossible, since these codes violate a coding principle' 
that one code not be a prefix of another (5) . The decoding 
of the. above^ two codes involves the inefficient* process of 
backtracking through the messa(ge stream after it has been 
received. ' - ^ * • * 

The reduction in 'storage, effected by the use of data 
)>epresentations , is not as great as the reduc^tion that, can 
^ be/ accompl ished by the use of uniform and systematic tecii- 
niqu^S' of data compression. This report describes m^thodfe 
which uniformly compress 'the data, rather than^ a select, set. 
of terms. These - methods may be used to replace standard 
representations or may be 'applied to data' in wh.ich some 
terms are already so represented'. These me-thods cquld 
reduce the high cost of computer operations by eliminating 
unnecessary incompatibilities in thi re'presentat lorv og aata 
.and by reducing the cost of storing the data. 

Thfe cost. of storing data is a very' significant part of 
the total computer system cost. This cost is composed of the 
direct charges^ for the'storage med ia , ' such as d isk devices, 
as well as ^the "costs of transfering the data to and from 
local and remote storage devices. Th^' latter costs are iri\ 
turn composed of the costs of the data channel's and/- for re- 
motely stored'data, the networlj, both ^of w.hich must . .have 
sufficient bandwidth to transmit'the data . Data cpmpres- 
sion results in cost savings by* reducing th^" amount , pf 
storage required to store data' files. In addition, d^tcf 

(1) Nat. h^t. Stand., Fed. Info. Process. Stand. Pubi. ' ' ^- 
(FIPS PUB) 5-1 . , 
12) FIPS PUB 10-1 

(3) FIPS pIQb 6-2. , • ^ 

(4) FIPS POB 8-4 ' * X 

(5) see section 3.1.1 , y ^ . ' ' 



^compr^sfaion methods may enable more efficient information 
retrieval operations as well as more- ecohojnical transmission 
or iar^e amounts of da.ta over computet; nVtworKs. There are 
several . types- of data .compression techniques, which range 
from the suppression qf null characters 'to patterH substitu- 
tion and statistical coding. ^ , 

nir^.Jc'^ f^^^^ repo'rt several types of data' compr^ession tech- 
niques 'are discussed alpng . with descxip&ns o£ some of 

is L = ?v ^^^^ , compression problem' 

■ c!^, ^"^^Y^^^ With respect td* a classif ideation of compression 
.schemes m ter.ms of the functional attributes of domain? 
-^fn^^h ' ^P^'^^tion.. In-^ition, concepts from informa- 
tion theory are introduced, in part's, to give the reader a" 
perspective from which to clarify ^nd mLs{,re the perfor- 
mance of compression techniques. , From information theory 
the compression problem may be 'seen as an aspect of the more 

or^n^i^"°^'^i;^'' P'^^^i^'"- The mathematical ^portio^I 

or part .3 hiay be sJt-ipped .wfthout seriously atfecting the 

selection aJ'/T"^'- ^^""^ ^^^'^^^ 

selection .of techniques are discussed with regard to the 

form ana, application of the d^ta "structure ' ' T 



2.' Survey of Data Compression Techniques 



"2.1 Null Suppression . ' 

Null suppression >t€chniques . enconvp^s's those * methods 
which suppress zeros, blanks, or both. This t.ype of compres- 
si.on could be called the d« facto standard, method for 
.corapressing data files. , It takes advantage of the Dre- 
valence of blanks • arid ' ze ros in some' data , files , and is easy 
^Cl^''^;!hr" to implement." Null sruppression may not, how- 
r.llr ^^^'^r^ high degree of- compression ratiJ as some 
other techniques Its obvious application is. to card image 
ddta records which formed the Basic data structure of many 
or the earlier datamanagement systerps. , 

One way of implanting'' null -'suppression is through' the 
use Of bit map. m which a one ihdicates' a non-null data. 

Item and a zero indicates a null item. This method is appli- 

or i!t-«^° i^^^ ^^'''''9 fiJted size units, such as werrds 

•or bytes. Figure 1 illustrates the method where a bit map 

4o^ta?n!no l/" if'" °' " coll^ttion of items. Uni?s 

;'«ont^ining all nulls are dropped from the collection and tt^ 
oit which corresponds to sucn un^ts is set to zero. 
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Or^iginal Data 



Data 1 



r 0 



Data -2 



Data- 3 



Data 4 



Compressed Data 



1 



10000100000110 



•Data 1 



D^ta 2 ' ' 



Data 3 



Data 4 



Figure 1 Zero Suppression Using a Bit ^p 
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AnotheV way to implement rr^l suppression is the 'run. 
length technique shown in fifl^r^ 2. A special gharactdr is 
inserted to indicate a run of -nulls; Following that charac- 
ter is a number to indicate the length ot tHe run-. The 
choice of the special character depen^ on; the cocje u6ed\ to 
represent the data. For code6 su^ as aSCII or EBCDIC' a\ 
good choice is onfe of the chariacters which does n6t occur 'in 
the data, of which\^there are many in these c^des. If the 
character set contains no unused^chaVac ters, such'as^in the , 
six-bit codes, the technique may still be used by selecting 
•an infrequently used' character and doubling it when it-^'oc*- 
curs as Pj^rt of th^ data. - , ♦ . 

• • • • - -4- " 9 r 



Original Data* , item A10000X02500000N)z^J9zJfejj!JCOST 
Compressed Data-j Item A1#4X025#5N%5CX)ST " ^ 

. ■ . • • Figur|^2 Run Length Coding 



2.2 Patterji Substitution 



of 



The run. length. technique is a primitive form of a class 
^\ techniques known &s pattern- substitution , in whi6h,«:odes 
^ are substituted far specific character patterns. Datr^iles 
?L^r "^.^Pe^ting patterns, such as illustrated m 

figure 3. These, maj. include numeric-and alphabetic informa- 
tion combined with or in addition to null characters. 

Original Data: - " . , , i ' , ' 

AE10004MFQ00000F320006BCX4 ~ 
AE2M00DBF00000F30fi000BCXl 
. ' A£30002RBA0tj000F301214BCX7 " • ' " 

Pattern fable: • , ' ' ■ • ' ' 

AE = , # . • • - . 

■ 000 =•§... 

, 00000F3 = % 

BCX ' = @ ' ■ . . • 

Coraprfe&ae(f Data • • ■ • . 

• ' *1$4MPQ%2$6@4 ' , ' 

. #2$0DBF%$00§1 . , 
^ ' #3§2RBA%01214@7/ . > 

f.. -■■> Figure 3 Pattern' Subs£itution \ 



^ ^ - \ ^^^^^ may 'be constructed either in- advance .or 

duting'. the . compression of the data. The table may be 
transmitted. vith the data or stored as apermanent' part of 
the compressor and decompressar . In -the method of De Main.' 
Kloss> and Marron the pattern is stored wit^ the data, while 
in the method of Snyderman and Hunt*, the pattern is stored 
m the comp^e^sor and decompressor,- As in null suppression, 

*See re^ference 23 ' * 



-the- code -tor the pattern is represented by uliused characters 
'from .the character set; 

/ihe statistical properties of the patterns ^nay be ad- 
vantageously used to increase the efficiency of the compres- 
sion. In the method of Snyderman and Hunt, even though tri- 
al ancj erro^ was u'sed to'select the patterns, the resultant 
pitterrjs were 168 of some pi the most frequently occurring 
pairs 'of characters .in^their textual d'ata files. The f re- 
quency of pairs of characters is further exploited by Jewell 
who chose 190 of the* most frequently occur r ing pairs a.s can- 
didates ,for substitution. 

* ' . ' 

• The compression method of Snyde.r^man and. Hunt and ^thaj: 
of Jeweli involve substituting 3ingle ^jhaVacter codes Jof^ 
specific pairs of characters. They differ pri^ar.ily in the 
wa^y - the pairs of cha.racters arfe selected,, J and secondarily 
in the selection of the substitution code. 

In- the method of Snydermah -and Hunt two lists of char- 
acters ar,e . selected based partly on. their frequency of oc- 
currence in English text. The first list, called the' "mas- 
ter characters", is a subset of the second list called the 
"combining characters". I*h the example given by the authors 
there^are eaght master characters^ ( blank E , I ,0,N ,T, U) and 
21 combining characters ( blank ,A,.B ,.C ,D ,E , F ,G , H ,1 , L,M,N r0',P # , 
R,'S,T,U,V>W) . ' ' . ^ • 

' The first step of the compaction process involves 
translating each character to a hexadecimal^ code- between k)0 
and 41»'leaving 190 cocHfiguous- codes at the etid , 42 through 
FF, for the substitution podes.- Next.,- each translated char- 
acter is tested,. ^in turn, to determine i:f it is a ^ master 
ch<aracter. If it is not such, then it <r^ output as it is; 
otherwise, it is used as a possible fj-r^t ^fCharacter ^ of a 
.p^i». When a master character has been found, th^ next 
character • in the input *str ing is tested to determine if it 
is a cqmbin/lng character. If,-it is, then the code -for the 
pair, is calculated and replaces b^th of the input charac- 
ters. If the next character is not a combining character 
then the .translated hexadecimal repr esenta^ions^Ior both are 
each\ moved to^^the output stream." Figure 4 contains a table 
of the compacte'd code, using this scheme. ' , 
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COMPACTE.D CODE 



Master' r 
Characters I 



CQ(nbj.ning 
Characters 



Base r 
Char Value I 
1 



, . Hex 
Char Code 





JO 


1 




00 


A 


6D 


1 


A 


01 


E 




•1 

1 

1 


B' . 


■02- 


T 
1 


97 


C 


- 03 


0 ^ 




1 


D 


04 ' 


N ' 


CI 


1 


£ 


-05 


. T 


D6 * 


1 


• F* - 


06, 


U ^ 


EB- 


1 
1 


G 
H 


07 
08 






1 


I • 


09 






1 


L 


"m ■■ 






1 


M 


•08 






1' 


N 


0C- 






1 


0 


m 






^ 


P. 


0B 






1 
1 


• 


0F 






S . 


10 






1 


T 


'11 






1 




'12 1 






1 . 


13 1 




if 


1 
1 

r 
1. 


w 


14 1 

# 1 



Noncanbining Characters ^ \ 

^ . 1 



Hex . Hex 
Ch^r Cod^CHar Code' 



I . ■ ■ .. I 

Combined Pairs [ 



J 15 q 2B 

K 16 . r> 2C 

Q ■ 17 • ■ s • 2D 

X 18 t 2E 

y •' 19 u 2F 

2 " lA V 30 

a .1b; w ■ 31 

b .* IC ,x 32 . 

c . V . . 3i 

d IE . z " 34 

e IF 0 ■ 35 ' 

f 20 1- - 36 

g' 21" 2 37. 

h 22 3 38 ' 

i 23. 4 V '.-39 

j * 24 . 5 3A 

k*- 25 6 3B 

1 26 7 3C 

m •■ 27 : £f^ ■ '3b ■ 

n - 28 "9 3E 

o -29 . 3F- 

2A .■ 40 



Hex I Hex 
Char Code I Char Code Char 

I. . . 

Y>Y> 58 
)zJAr59 
jz5B 5A 

)zSC 5B-. AO" 
m 5C . . . 
jzJE 5D 
jz5F 5E 
'i^Q 5F 
.m 60 
m 61 
|z5L ^2 
m •'6-3 

m 64 

jrfP 66 



• 41 r 

42 I. 

43 I 

44 ] 

45 I 

46 I 

47 I 

48 I 

49 ' I 
4A I 

. 4B I 

4C' I 
4D '1 

'4E I 

4F r 



'W 

AA 
AB 



"AW 
EA 

• 

EW 
116 

:0^ 



?• » 50 I 

51 I 

# \52 I 

e ^^3 I 

" , 56 I 

< ' ■• -57 .1 

■< ■ f 



m 67 

i6S 68 

jzJT 69 

m 6A 

m 6B 

m 6c 



vw 



v 



(in the above jrf = tjlank) 
Figure 4 



Hex I 
Coder 

6E I 



6F 


1 


70 


I- 


• 


'1. 


81 


1 


32 


■ 1 


' 83 


r 


."96. 


' 1 
. 1 


97 


1 




r 


■ AC 






U 


CI 


■1 




■J 


,D6 






1 


E& 


1 ■■ 


FF 


1 

.1 



I 

"1. 
I - 



, using the technique described,' the Science- Information 
Exchange compacted the text portion of a 200,000 record on- 
line file from an average, of . 851 to 553 characters per^ 
record, a decrease of 35 percent.. .Using an IBM 360/40 -the 
coraprebsion takes ms. ior 1000 characters while expansion 
taKes only 65 ms. The extent to which th^ decrease was due 
to null suppression can apt be determined from the authors 
report. Such a determin^^tion would be necessary before an 
accurate comparison betwefp methods can be made. 

% 

The method of Jewelli, takes into accou;it -the full 190 
•most .frequently occurrijig Character pairs in his s-ampl.e , , 
thus taking advantage ofs/?Tthe. avail abil ity of the 190 """sed 
codes in an 8-bit re^esentation . Figure 5, compiled by 
Jewell, IS. a 2-char acter^Jf requency distr ibution of the 25 
most freqiiently ocCurriiig pairs in a sample of text. The 
190 pairs-are- entered iAtb a table which ■ forms a . sem,i- 
permanent part of the compaction process. T>e first step of 
the process involves shifting the first two characee,rs of 
the input stream . into a- register. If this pait occurs in 
the -combination table then a code is substituted ^or the 
pair. The code, is ttje address of the pair in the. ^.table. 
Two new character! are ttien entered and the process ^resiumes 
as in th&^^*<ag inning. If the input pair is not in the. table 
then the first character of that pai^ is translated to a 
value greater then hexadecimal BD (which equals 190, the 
length of the table) .and- sent to the output stream. One new 
'character is shifted in with the remaining ' second character 
and the process resumes.. _ ' ^ 



i 



1 xvonK 






1 Occurrences 


1 Combination 


1 Occurrences 


1 per Thousand 


i ' 1 


1 LjfD 


1 328 


U 26.89 




fDl 


1 292 


1 23.94 




In 


1 249 


1 20.41 


1 A ' 


J2)A 


1 244 


1 20.00- 


1 ^ 

1 D 


cw 
S^D 


1 ■ 217 


1 ■17.79 




KE 


1 200 


1 16. 4p 1 


1 "7 i 
1 ' 1 




1 197 


1 16.15 1 


1 P 1 
1 O 1 




1 183 


1 ' 15.00 1 


1 Q 1 




I - 171 


1 14.0-2 • 1 


\ ^ (A 1 
1 iv \ 




1 156 


1 12.79 1 


1 11 1 
i 11 \ 


|z50 


1 153 \ 


1 .12.54 • 1 


1 • 12 1 




1 152 


1 12.46 1 


K 1 J 1 


£S 


1^ 138 


1 12.13 1 


1 1/1 I 

1 14 1 


Job 


141 


1 11.56 1 




ON 


140 , 


11.48 1 


1 lb I 




137 


11.23 1 


1 1 / f 


TI 


137 


11.23 1 


1 1% 1 
1 1 




133 
133 


10.90 • -1 
10. 90*' 1 


20 1 


ATr 


1 1 Q 


* • 9.76 ■ 1 


21 1 


TE 


114 , 1 


* 9.35- 1 


22 1 


Jz5C .1 


'113 1 


9l26 1 


23 1 


}6S 1 


113 • 1 


9.26 "1 


24 1 


OR 1 


112 ^ 1 


, ' 9.18 . ,1 


25 1 


I^' ' 1 


109 ■ 1 


8.94-, , 1 



Partial results of a 2-9haracter\frequency test' 
The text size is 12198 chai^ctecs 

Figure 5|^ " 
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■'2*. 3 Statistical Encoding 



Stati-stical encoding is another class of d^ta compres- 
sion methods wh^ch may be used by itself or combined with a 
pattern s\ibstitution tecTirrique. Statistical enco'di-ng t^kes 
advantage of the , frequency distribution of characters so 
tnat short ^presentations are used for characters that . oc- 
cur trequeqjgLy, and longer representations are used for 
characters- tTTat occur less frequently, when cqmbined ' vith 
pattern substitution, short representation may be- used for 
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some frequently occurring pairs or other groups of charac- 
ters.- ^ Morse code/'^for example, use^ short code groups for, 
the common letters, and longer code groups of the others* 

When binary ones and zeros are used to represent a mes- 
sage in variable length codes, there must be a^ way to tell 
where one character or pattern ends and the* other begins. 
This can be done if the code has the prefix property, which 
means that no short code group is duplicated as the begin-^ 
ning of ^^a. lone, jr , group, . Huffman codea have the prefix qual- 
ity and in additidn are , minimum • redundancy^, cod^s, that is 
they- are optimal in ^ the sense th^at, data encoded, in these 
.codes could not be- expressed in fewer bits. 

— * Kjj9JijLe_6_shov^ the combinatorial techniques u^ed to 

form Huffman codeTT; The characters, lifted in descending 
order of frequency of occurrence, are assigned a sequence of 
bits to form codes as follows. The twp groups with the smal-* 
lest frequencies are selected, and a zero bit is, assigned to 
one and a one* bit is assi^hed to the other. *These values 
will ultimately be the value of tiie right most bit ot the 
Huffman code. In this case, the right most bit of A is 1, 

^while, that of B is 0, but: the values of, the bit assignments 
could have been interchanged. Next, the- two groups, A'and 
B, -are then treated as if they were but one group/ 
represented by bA, and will be assigned a specific value in 
Jthe second bit position^ In this way both A and B receive 
the same, assignment ' in. the -second bit position. The above 
process is now repeated on the list E ,T>4 #BA, where BA 
represents groups k and B, and has frequericy of 10%. The' 
two least frequently occurring groups, represented by 4 ^nd 
BA, ar^e selected, and a zej^o bit is assigned to character 4. 
'and a one bit is assigned to BA. These, values will be the 
values of me second bit from the right of. the Huffman code. 

. The partial code assembled up to this point is^ represented 
in the step 2 column of Figure 6." In each of steps 3 and 4 
the process is repeated, each time fprming a new list by 
identifying the two elements of , the previous list which had 
just beerl assigned values, and then' assigning 'zero and a one 
bit to the two least' frequently occurring elements of the 
new li^'t. In this example, message^ wr itten in the Huffman 
codes require only 1.7 bits per chairacter. on the ;^verage, 
whereas three bits would be r-equired in the fixed length 
representations. The synthesis of Huffman cbdes wiil be 
discussed in greater detail in the next section. 
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Figure 6 Formation of Huffman Code 



Telemetry. Compression 



inost ''d^r^Les"""*!^"? techniques are not applKAble to 

measurements at reguiar fntlr^al^ ' le^ic^ records 

'then tran<?Ini^-^«/^^« intervals. The measurements .ai'e 

cGssinn ?^ ^ central location for further pro- 

cessing. Compression is applied brior t-^ t-r^r^ZTi ■ 
reduce t-ht» ♦■ot- = i r a^Fiicu prior to transmission to 

xeauce tne .total amount of data to be transmitted TelemPt-rv 

n%e^L^?ra-^^.s%~?^ 

renr^tT. f ^-^^^^'^'i^in^d value.. Otherwise, - the "f ieL^^s 
^h^i^K^^"^ >^ so'^e escape character to indicate 

make th/'-''^"'"' field- is not coded . The condUioJs that 
make the incremental • .coding technique effective i-h*. 



3. 'Ahalys'i^gf D^ta- Compression " 
.^Data compression mAy^ t>e r^presehted^ as. the application 



of some function X6 ej^ements of the data 
be a specified elemef>t o"i the data base, 
sion of X 'is y=£(x)./ 



base . If we' let x 
t h^ n' the c o m p r e s - 



may be a str ing 



Here, x, the-eTement of the data' base, — - ^ 

^6f on^^or ifiore bits, bj^tes, cJiaracters, pairs or n-tuples of 
^'char acf^er s , words, or text , fragments. 1 is a function.' that 
maps the element x ' inta som^ other element y\ ,The domain of 
a function is t'hat "set ^upon "iwhich the function \operates, 
w'hile tne range is that set whose elements |are^ the results 
pt the function operation. Tne different compression tech- 
niques may be characterized by the choicej of tne dojiuain^^' 
range and the operatichi of the function £ • , ^ 

Usually^ f IS invertit^e, which means thfati th6 original 
dat^ may be recovered from the compressed d^tau However, in 
som^^ appl ications/ a non- invertible cho icet bfl f ' may be ad-* 
vantageous. For ^example , when the data basevtolb^ compressed 
consists, of re9ord identification Keys, only;' an abbreviated 
form of each k^y may be necessary to retri\evd^ eac^h r^cdrd. 
In that case a n'dn-invert ible compression tephniiqiie that re- 
moves §ome of the information from each k^y would generate a 
mor^'^cbmpressed key file than- one that was invettible. 

in the method of Snyderman and Hunt the Bomb in was 
the collection , of- pairfe o*f characters, Thelrange of i was 
the collectiOn'«>f bytes, and f was. invertibiel ^he defini- 
ti,ons of the Dpmain and Rahge for the other mdtiid^s a^re sum- 
marized in table i. i \m 

\ It appears, that compression techniques^may , foe classi- 
fied in terms' of the type of domain, range a^d^operation. 
Of the methods surveyed, the domain was composed W)f either 
fixed length or variable leijgth elements. ,'The iange*, except 
for those techniques that generate Huffman codes',, iwas Qorri- 
pbsed of tixed length elements. To generate Hutfroan ' code^, 
the function maps the domain into elements 4^ose.* Mength is 
inversely proportional to the fr^^ency of occurrence of the 
-element in the domain. ' ^ ' V \ , ^ 

In some casFs the methods differ only in the ! function 
definition. 'The difference ' between the method o£ Snyderman 
and Hunt and the one for Huffman co^ie with patterns is that 
'in the -first case -the function maps characters l^nd pairs 
into bytes while^n the l^atter case the function maps these 
^ame elements int^T^ar iable length fields. . 



, / - Table I 

Domain and- Range of a-Sample of Data Coihpresssion.itetnods 



A^.'fethod 



, 1 Snydem^ & Hunt 



I ' Scha€l>er & Thomas 



Jewell 



Lynch 



Hahn 



Ling & Palermo 



I Schuegraf'& Heaps-' 



l_ ^ 

I Huffman Code 
)r ' with" patterns 



Domain 



T>ai>rs of characters 



Characters' 



fixed ^ength fields 



text fragments 



pairs of characters 



Kange 



bytes 



byteS" , 



fixed length fields 



Tiiree f-ields 
two are fixfed length, 
r "Other 'is multiple words 



fixed length fields 



viable .length 
bihary strings 



Ihe performance of these methods, Qjhosen somewhat arbi- 
- represent a cross Sampl^^'f the aatdcompression 

methods 1^ thfe Irterature,. differs b^h in terml- of percen? 
reduction and computation time. As one i?ay suspect, the more 
complex methods, such as the Huffman code ■ generators,- re- 
?h^^^.f'"?''^ computation - time than the simpler methods like 
that of Snyderman and Hunt; The Huffman cod^ method did ob- 
tain a greater Jpercent reduction than the otljers, so the^ in- 
creased computation time, may be wor'thwhile for some applica- 
^iSn^* 0" the other hand, the text fragment method of 
HmP^So^L^"^ ^^^r ""^^^^ ^ Significantly longer compGtatioh 
M fco-^Piish roughly the same-degree of compression as 
the simpler digraph methods. Table II contains a summary of 
Published performance of some 'data .compression methods. 
Notice that the measure of performance in the table is the 
wn t''ho°\ storage space. Lafeer. in the paper% that measure 
will be shown, ^o be unreliable when compared to the measure 
of entropy of .the data. , . • casuie 
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Published Results of Scxiie 



Compression Tfechnuques 



Method - " rr) I \% 'Reduction 



Snyderman & Hunt 
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Schieber & Thomas 
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.Lynch 



119J. 
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Institute of Elect. Eng. 
i INSPECT system ; 
land;feritish National biDi,i 
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While the compression methods Vdescr x-b^ji -in the^ 
bchuegrat and Heaps paper have- 1 imited'.btil ity,/ because; as 
noted above, their complexity does not .increaise their efiec- 
tivness over the more simpler digraph mWthods, the discus- 
sion of variable lengtn text fragments in\that j^aper leads 
to a I'elated question about the structure of the aata base, 
what form should the dict^ionary take? / unver.ted-t ile re- 
trieval systems using free text data ba^^es\ commonly identify, 
words as keys or index terms about- w)iich the file is invert- 
ecr, and through which access is prbyiaed. \ Tne. words of na- 
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tural l^guage exhibit a.Zipfian * r ank-f requency relation- 
• nr™ I ^ ^^^^^ number of words account for a large 

proportion of, word occurrences, while a large number of 
wprds occur inf req^ientl-y . , T^e inver ted.-f He system involves 
large and growing dictionaries and thus may entail ineff i- 

te-ristics'''?^'°" °' 'T'^^ '^""^^ °^ distribution cSLac- 
of kpi^'fAr f'^f^ advantageous to consider the formation 
D^rt-?^^f. I fUe-inversion from units other than wojds. In 
particular it variably length text fragments are chosen as 
^^k';-^?'^" compression method may be a pS^'ful 

n°i ^P^" inverted-file systems. A- te-, 

ififJ^^^' by Clare, Cook, and Lyixrh [4] discusses the sub- 
ject of van-able length text fragments in greater detail. 

3.1 Noiseless podtng Problem 

Most of, the compression methods described' in the 
literature are approximations to the solution of the noise- 

■iarLb!e'"LKf' r' "'t^' described as follows^ A ^^nd^n. 
variable takes on values x with probabilities 

< ^ 111 

Pi' 'P„' respectively. Code words w, ' o'f 

J» m > 

^lengths- n^-,....,n^ respectively, are assigned to the. symbols 

'\. '^^^ viords are combinations of characters. 

taken from^ a code alphabet a^ a^-, of length D. The 
^blem IS to construct a uniquely^decipherable code which 

minimizes the average 'code-word length H =^p^n . Such 

onT."'^^ "^^^^ optimal in this paper. Usually the al- 
aonfnL.''S"^'^^! °^ sypibols -0 and L. The problem may be' 

approached in three steps. First we establish a lower oound 
ho...2' .1 '^^ how close we can come to that l^wer 

bound;, then we s^^^e^ize the best dode. We shall' indicate 
to ..what degree the various compression methods are attempts 
to synthesize the best code. . cittempcs 

w3?!?h V'E^ distribu^ionir^4 hyperbolic distribution in / 
whlich.the -probability of /occurrence of a . wdrd is 
inversely proportional to dhe rank of the word.) If r is 
bv n^rf"- ^ t ^^^y^he probability p defined 

oy P(r; - -; where^ is^constant chosen so Zhat the ' 

■ N ' /. , ■ ■ 

sura^pCr .) ^ \. - . ■ , ■ . • 
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3.1.1 Uniggjely Decipherable Cbdes .. What is a uniquely *dec i- 
pherable ,code? For example, consider the iollowing bin^ty 
code-: - '\ ; . ^ ■ • 

Xj 0\ •■ 

X2 - 010 

X- '01 ■' 



'4- 



10 



Th^ binary sequence 010 could cotrespohd\to any*' one of t;he 
three messages x^, x^x^.^or x^x^^ sequeYice; 010 

cannot be decoded accurately, the fopowirng .definition is 
needed to establish a rule to avoid sOch sequences*. 

A code IS uniquely decipherable if, every tinite jser 
quence of code, characters cor responds' to ^at most one Ines- 
sage. - , : '1 * 

.One' way to -insure unique decipher abil ity> is to .require 
tnat no code word be a -prefix of another code ^)^ord\ If A, 
ana C are finite sequencesiof code char ackers then the jiix- 
tapbsition o-f A and^C, writtep AC^. is the seqfldpce forrafed.be 
writing A fdUowed^by C. The sequence A is a. prefix -^of the* 
sequence B if B niay bfe written.^s AC for . ^gome sequenc^ C. 

Codes which havf' the* above propel ty, namel'|^. that no 
cotle word is a prefix of* another code word, a're'^called in- 
stantaneous 00(^3^^. The code below -is an example^^dl an in- 
stantaneous code.'' 







'^1 


0 


x" ^. 


100 


"3 


\1^^ 


^4 


11 



Notice that." the; sequencj* ,11111, 10101,' or 100^ do 



\,:... 

,11111/ 101M1,' or iiowx .QH "^O^ 
coVrespond to any me^pge; so such seauences should n^^^.er 
agpiai and 'can be disre^ded. The commoftly used ASCII ^^nd 
EBCDIC "code^ are also instantaneous; but they are s*ch b%-' 
cau^e Idt their 'fixed length; since all fixed length copies 
c^re inWtantaneoua* Every insrtantaneous code ^s uniquely '^e- 
cipherSBle,, but not conversely* To 'See this, - . for a given 
•tinxtel sequence of Qode characters oJE an instantaneous code, 



proceed from le-f t -to r*^ht unfil a code word W is formed.' If 
TnnH^i??? "° ''^"'''^ formed, the^ the unique decipherability 
condition xs vacuously satisfied'. Sipce w is^not tiie" prefix 
of any .code woro, w must be the/ first symbol of the 'messaae >^ 
Coatinumg until a,»othei> code.i;,ord is iormed/ and so In] 
this process may be repeated until the end of the messag^ 

mav 'h'«%l^^!!!K'"^i^"''^"-°"^ '-^^^'^ facit'tftat ttie^ code 

may be deciphered step step. If,\hen pro^ceedihg, ^ett to 
right, is the f irst word , formed /we KnoV immdUiatlly tha? 

5^ first -word of the message. Ih a un'iquely dec ipher- 
abl€ code Which is not instantaneous,- the decod^g p^oSess 

ttl f^rt '^^ i^"'^^""^ fo"^^ long time before the" i3en?ity.^t 
the first word is known. For exam&e,Mt in the coa^ 



. ' •' (n characters)-. « 

we received the Sequence of n+1 characters 00.... 001 we 

^hat th^1irs^''^'^"?''' '^^^ sequence to find out 

that the first .symbol is x^. Fortunately,' the solution' to 

the noiseless coding proble^i can be realize^ith an instan- 
taneous code. Notice that while the ASCII /and EBCDIC todes" 
.are instantaneous, ,they are usually far from optipal. 

1.1.2 Optimal Codies. The degree of- the- optiihality of the- 
code IS measured by the entropy of the mefslgl of text T^l 
entropy H(X) is defined a^ . x . 

• "^^^ = •fPif°92Pi ' - • . - ■ 
where p^,......,p^ are the probabilities of v th^ ^ message 

- - ■ ^ . K , • • \ 

The following^ theorem gives the Wr ' poi^a to tne 
average len^ n o-t the co/de. 




. (Noiseless Coding Theorem) J . - if H = !^%^n^ is' the 

average codi word len^^h of a uniquely dec iphix^ble code, for' 
the random ^riab.le X^, Ithen;*, n > H(X)/lo4: d, w-ifeh 'equality if ' 

and only if- p.=d"^. Notejthat H (X) D. .. is the 
uncertainty of x using ;iogar ithms *tf^ the.,base D,' that iV, 
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For the enviropment we are interested in, tne coding 
* alphabet is binary, so b = 2.^s3S^.the lower Dound is simply 
n > . -^(X) IS not only .tne lower bound to tne lengtn ot 

'tne code needea to represent tne data, it also provides a 
measure ot tne improvement tnat niay De expecl:ed by compress- 
>ing tne data, ine comparison ot' tne value ot h(X) *ta tne 
current average code size, which is 8 for ASCII or LbCDlC, 
gives a measure of the improvement tnat can De realizea Dy 
compressing the. data • If H(X)=8 then no compression is real- 
izable by coding'the data diftereritly;- if H(X)=5 then up to 
.an 8 to 5 compression' ratio may be obtained. The comparison 
of** the- improvement realized by a .specific data compression 
technique to the theoretic infprovement given by the above 
ratidS^can serve ^to evaluate the effectivness of the tech- 
nique. The measure of effectivness usually given, the file 
length before and after compression, does not in^icatfe the 
true level of compression,, since the compression may have 
been, due mainly to null suppression. 

Any code that achieveV the lower bound of the jioiseiess 
coding theorem is called absolutely optimal. The following 
code is an example of an ab^solutely .optimal code. 



2 
^3 

.^4 



Probabil ities 
1/2 
1/4 
1/& 
V& 



Code Words 
Ik) 

.lit) . . 
Ill 



h(x) -"S = J 



• J 



' In. a previous example ot- a Huffman code, figure 6, tne 
average code . length of the huttman cpde was 1.7 Dits per 
Character , "while the value of the . entr^i^ R(X) was^ 1,156. 
bits- per character'. That example ilWs'trates the general 
impossibility of contracting an ^absoPutiely optimal code f,or 
arbitrary collections of characters. That example al3o il\ 
lustrates that any coding, method will be\bcMin'd by the value 
of H(X) . ' ^ — ' * 



3.2 Realization o,f Optimal Codes 



yrfhile the theorem states the ex.istence of an absolutely 
optimal code, in general the con^str.uc tion/of one for an" ar- 
bitrary set of probabilities is impossible. For a given set 



Of probabilities the code is to ' be 

absolutely optimal, the lengths of the code words, must be 
Chosen *tp satisfy p. = d""^ which i-s the samejas » ^ ^ 



n . = 



<-log p.) 



1 ~ - log D 



Obviously each n. may not be an integ^er and yet satisfy ' the 
tTni- «^«-ver we may do the next besJ thing- by 

choosing the integer n. to Satisfy the inequalities: ' 

' ■ • * -iog Pi " -logp. - ' 

to^ exist in Which the 

tneor.i•^rL?^/^*'^•^■ ine^iMity. .The following 

tneorem characterizes such codes. I , . y 

y 

Given random variable X with uncer^tainty H(X) , there 
exists a base D instantaneous code for X whose average 
code-word -length n satisfies avcidye 



logo i " ^ THgD ^ 



^For a proof see Ash, page 39. ^ 

* • ' 

hp ml^^^ t^^t the average code-word lengtn may < 

be - made sufficiently small to be within one digit of the 
InnZ "k ^^'^ ^^^^^^^ noiseless coding theorem. ?hat lower ' 
used Th^ approached arbitrarily close if block' coding is 

fnrL .coding of length 2 is usedj, Block coding 

Tlr.^^^ follows^ instead of assigning °a codeword to each 
symbol x^,.vre assign a code word to each group- of s symbols. 

inother words, we construct a code for the random vector 
^^1"^2' -'Cs^ ' ^here the^ X . are independent and each 

X. has the same distribution as X. If each X. assumes M ' 
values then Y assumes ^ values. ' The following example 
biLt'codlng'^^ decrease in the/average code-word length by 



... X p Code' Word Y = '(X]^,X2) P Code Word 

.■/ X ■ 3/4 0. " x,x-, ■ 9/16 .0 

xi 1/4 ' 1 . ^Ui ' 3/16 • 10 

. ..2 . . • x^xf • 3/16 . - 110 

^ ^ . ' ^ ; ^ ,^2^ i/^^ . : 111 ' 

n = 1 * ' * . 

n = 9/16 +^3/04? (2) + 1/4, t3) " 
= 27/16'code cliaracters/2 values 

' ^ of X . * . • . 

= 27/32 code char acter s/vaLue 

* " . ' ot X ^ 

y ' ' . : ♦ * _ 

By the above theorem, the average code-word ^ength 

£or^ the block of length s satisfies ' 

^iIl. '< ^^V ^ + 1 code char acters/val lie of Y. 

log D ~ s To^D 

H(Y)^= H(X^, ,Xg) < H('x^)+ +H-(Xg> -whether or. not the 

X- are indepe^ident from each other.. If they are indep'endent , 

then the inequality becomes arf equality. If the' X^ are^ 

identically distributed> then H (X^) + . . . . ^ +H fX^j = sH<X) . ,1 

the classical case, bothr.independepce and ^identical distri- 
bution are assumed, in 'which case the aver age .code word 
length satisfies ^ 

sH(X) . - ^ sh(X> , • ' 

• . k^g^ ^ ^log D ^ 

or ^ *. . ' - * - '-^^ 

^(x) . H(xo .1 ^ ' * ; ' 

log' D ~ s log p s ' - • * 

^hile for text tile's and messages,, the independence of /each 
X^ IS a tenuous assumption, th^ as's\imption th^t^each X^ is 

.Identically distributed is c^redible. Upon dropping the ; in-^ 
dependence ,a3sumptioTi. the ^^bove.^inequal ity becomes - 

' > ' * 

• "^ "(X^, _ H(X)_ + 1 ■ . , ■ 

• sTTog-nTT s ^Tog-D s*- . - ^. 

.thus we see that regardless of the independence of^the "ele- 
ments of .the blo'ck, "thes upper baund'oiP the average code 



length ^nay. oe ,n,adNe as close. to ^^gl^...as 'des\red by 

■ ipcreasmg- the block lengtii. On ' tl)e othe/.Hand , the lower 
lim-it: may be smaller when the elemeiits of the block" are not 
independent as is the case frequently ih text files. T.hus 
for the conditK^ns applicable to 4:ext files and messages the 
?h! ■ ^^"gth may be jnade at least -as small as 

the optiJnal length characterized by - the noiseless coding 
theorem. 6 The ->dependence oi characters in text files- may ex- 
plain why the simple /d'igtaph methods are -so successful* That 
dependence -Is further .Exploited in " thS^ method of Wagner 
which, substitutes codes for entire English phrases • 



■ \ ■ 

3.3 .Synthesis of the Huffman Code 



So. far onlyXhe -existence of optimal cod^s has been 
discussed; now /'thfe s^^nthesis of bne such code, the Huffman 
CQtie, will be Illustrated. For the synthesis of" optimal 
codes, only the instantaneous codes need fo be' considered 
since If a^cod.e is optimal with respect to 'the class 'of • ih- 
stantan^us >coded, then i^ is also optimal with respect. to' 
all uniquely decipherable podes. Thig -char ^cter istTc is- 
inaeed f or tunate ^smce instantaneous- codes are the codes of 
Choice for data transmission.- and processing applications. 
Ihe precise ■etateraent of thi^'eilaract^^istic, is as 'fbllovs-. ^ 

If a code C is optimal witni/i .the class' of matantane-^ 
ous-coaes.for the g iven prbbabU iVies p-,p^,A p^, „nacn 

means that no other instantaneous code fbr the. same given 
set of probabilities has aV^aU'er avejrage code-word length 
tftan C, then ,C is optimal vi'fhin t-he. entire class of unique- 
ly decipherable codes. . ' • . • • « 

; < • 

For a p^roof s£e Ash page 40.. " ' 

. ' ' ■ ' • ' , ■ • ' . 

An optimal binary code can be char.aeter ieed by certai-n 
'Jt^^ll^'y ''9''<^^^^ons .yhich restrict 'the choices of 'code 
lehgths that may be assigned to each dode";- These "ch'ar acter i-' 
zations are a§.f<611ows. . ■ i-ci x 

. : -Given a binary code C with word lefig«t^.s .n^ ,n. , . . . ,nj^ 

associated with 'a set -of symbols .wi.eh probabilities 
P^,P2, ,Pj^, assume, for convenience > tha^ the Symbols are 

arranged in or'd-er ^ ot- decreasing prbbabil iK:.y' 

^P-l - ^2 ^ ....> 9^^) and" that a group, of symbols with tiie 

saine prqbability is arranged m order , of 'increasing code-. 



word length. (If = Pj^^^ - ^i+r ' T 

n^ < ^^^4.]^ • • < "i+r*) '•T'hen if C is x)ptlmal witnin the class..- 

'of instantaneous codes, C must have the following proper- 
ties: 

a. higher probability symbols^ have shorter code words, 
that IS, Pj > p^ implies n^ < Hj^.X j 

* D. 'in& two least probable^ symbols nave code wotds ^ "Ot 
equal length, that is, n^^^j^ = 



ERIC 



c. Amon^ the code words of length n^^^ there must oe at^ 
least two words that agree in all digits except the last, 
Iror example, the following code cannot be optimal since coae 

Xj^ ^ id ' 

^ X2 IkJk) 

- • . . lk)l 

^x^ . 1101 

words 4* and' 5' do not agree in the first three places, 

"'^ ' 

For' a proof see Ash page' 41. ^ - - * 

The constru^ion of a Huffman code for the characters 
c^, ,c^ wi\4/ probabilities* Pj^, ,p^ respectively, 

involves generating a biSrary tr ee ^ ''■'^ ' f or which each of the 
above characters ,is represented as a terminal nod^e and the 
t)ther nodes, the internal node^s, are formed in thre following* 
• manner. First from the two nodes with smallest probabil i- 
tiesr, say Cj^ and c^f a new, node Cj^ 2 with probability Pi***P2 

Is formed to be the father * of c, and c^. ^ Now with the" 

reduced set o^ n-%i nodes, which consists of Cj^ 2'^3^'' 

^ith ^obabilities P1+P2 'P3 ' • • • 'Pn ^respectively , repeat the 

" V * 

aoove procedure; and continue to repeat it until reauced set 
. consist's of only tw6 nodesf Mow consider the binary tree 
which consists of the terminal nodes and ail the new nodes 
formed by the above process. For eacn successive pairs ot 

llj A binary tree is a graph which consists of a root 
'node' and descendent .nodes. From the root node are . 

links to at most two otiier nodes, the descendants of? 
^ the root node^ Each of "'these descendants, in turn,.ar^ex% 
l*4nked to no more than/ two other nodes; ahd these ^ 
flatter 'i?o<?^s may be -similarly linked to other nodes, 
arid so on. . ' 
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rh^r.iJ branch. The resultant code for each of the 

ttacTna^'kl^ Irl^^^^ values • obtained by 

tracing the tree< from , the root to each of the tprmin;,! 

codf-i '^''^!; ^29regate causes the itemi so chose^" to haJe i 
code length of one more binary digit; so the average lenath 
xs minimized by givi^.^- this exlra digit to the !e3t p?ob- ' 
able clump. The follSwing example, illustrate! tA^ethod? 

Let the characters be c\ > • r r o • u > 

1»^2' 3' 4'^5 ^"^o have 

probabilities .3,* .5, .2, .15", .05, respectively In the 

ar'f reorLL't^i^K "^^"^ ^ermi^;i nodes 

are repregeAted by squares, the other nodes by circles " a'nd 
in each sqykre and circle is the probability of tJe nod^. 




y 


0 


.2 . 











The Haaff man. code for each of the character 



s is: 



Character 
. c. 



'A ' 



Code 
00 
01 
10 
110 
111. 



^T, K of the Huffman code, a varTa&le lenqth al- 

?^:rf 'a tr:f''wh'% ^^^'^'^-^ - Paper by and^iuck^ . 

i=?n!; t^^t'- ''^^''^ optimal in another sense, is ob- 
tained which preseirves . jthe original order of the terminal 
Trttlk ""i"".'""'^ algorithm, alphabetical codes may be gen^ 

ables ordering operations to be applied to the cpded te-xt in 
the same way as^the uncoded text.,, 
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.Observe that for the 'formation of the Huf fmaii^ code the 
distribution of the characters or blocks must be known in 
Advance. It Ttnay appear that the Huffman code is valid only 
for each instance or version of the data so'tiiat a new code 
hiay have to be generated for each data baker and for each 
change to the data base. Fjor tuna tely r^^iihe distribution of* 
characters Is . not that sensitive ^to cj>^ges in -the data. One 
study has shown that the distribution of -character s ^.or ^ 
particular data base is stable ovey a period o-f time. [18] 
Moreover ''the same distr ibution '^eems lo be reTat'ively stable 
across differl^t English text data bases. Tjie following 
graph shows the distrioution of characters in a typical En- 
gl rsh^ text. • , • 



INDIVIDUAL LETTERS 




ETOANI RS HDLCFU MPYWGB VKX'JQZ 



34 



19 



DIGRAPHS 



19 


19 


17~" 




14 


14 


^3 


IT 


12 


12 



TH HE AN EH ON RE IN ED ND AT OF OR JIA ' 



TRI GRAPHS 




M 3 I 31 31 31 ^1 31-71^ 

THE AND THA* HAT EDT ^ENT FOR ION ^ TIG NDE HAS MEN 

Normal frequency, distributioiv of the letters of the alphabet 
(in uses per thousand) 



The following t^ble, from the paper by^ Lynch, Petrie^ 
and Snell [18], shows a distribution of cli'aracters which is 
close to that in the graph. ' ^ ^ ' . 

For a 'given Huffman code, changes la the average' code 
viQt-6 — 1-eng-th — wi^-i^^res^>eet -to-ehanges--in- trhe-d-i&tr-i 
the characters may be analyzed in the folLowing way. .Let tne 
code word lengths be n^^ ,n2 ^n^^ , where ^i<J}2^ ^^m' 
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Tadix 1. Normalized I RLQucNcus wini mians and .standaud DtviATiONS for thehrst 29 characters (arranged in ranked order). 
Tnr uus ANAL^stin raN(.l from INSPEC 31002 (1969) to INSPEC 31060 (1972) , 





1 1 AAT 




1*1 Al< ' 


1 1j 


J lUIO 


^ 7 1 A 


11 Al7 




31057 


31060 


s.d. 


Mean 


V 


1 ^1 1 
V'i Jl 1 


A.I CAC 




A. 1 ^AQ 






A. 1 «rM 
U 1 J\rt 


0*1485 


\ 0'1498 


0*1502 ' 


0-0054 


0-1483 




^ A.AOQQ 


A.Aonn 


A AQQ*; 








A r.OAn 


0*u902 


0*09*06 


0*0883 


0*0039 


0*0875 


T 
1 


U U/jU 


V'KJiz. 1 


A rn^o 










\' \J f 


0*0719 


0*07^8 


0000^ 


0*0725 


1 
I 


U'U/Zj 




A f>7^^ 


A t\T>n 




iO 0739 


A mis 


*0*07.'f5 


0*0736 


0*0731 


0*000856 


0*0731 


o 


0*0722 


A A^Al 

O'O/Ol 


U'U/U ' 


U U/ i Z 


A ATA^ 


rt.ACOQ 


A.OA*"^^ 
^ U UOVj 


A.A7rt1 


yj \j\}yj 


0*0715 


\J \J\J\Jy i <J 


0*0705 


N 


0*0677 


0 0071 


A 

U UO/ / 


A, A/; OA 

U U()oU ^ 


A.A/»T> 


A.AAT5 
U UO/Z 


A.Af'»7v: 


A.A/;»;'^ 


\J V/0\J7 « 


0*0674 


0*000527 


0*0673 


A 


0*0641 




A A/ CO 




A rv./i<; 


^^ A/iC 1 
U UOj 1 


s > IKyr** 


A.AAM 




0*0654 


0*000712 


0*0651 


K 


A AC/'O 

U Uj(>o 


u*u>()y 


^ U UjOj 




A.A^T7 


U U,' / 11 


A.A^7t 
U Uj / 1 


0 ns07 


0*0573 


0*0563 


0 000949 


0*0570 


S 


A AC1A 

0*05 JO 


A AC^A 

O'OjiV 


A.AC '41 

U*UMz 




A.ACn ' 

U UjJ / 


-A A^Tl 


A A^l^ 


V AA:di 

U l/J**) 


U \JJj t 


00530 


0*000608 V 


0*0534 




A A1AO 

0 03Vo 


A. Aim 
U'OJv/ 


A. A tAI 


A. A iOT 


A AvfA'> 


A niOT 






- 0 0394 


00388 


0-000545 


,0*0398 


L 


0*0370 


A f\'in/\ 
O'UJ/O 


A AT*»n 




A.Al'ftf 


- A A-ITC 

U Uj / J 


A AnA 




n'ni7n 


0*0375 


0-000368 


.0*0373 


M 


0*0267 


0*0259 


i\ AOT 1 


ft AO/l" 


.U*U-i() / 


-A 1 


A.ATylT 
U Uz*» / 


U pZ /c^ 


U \JLJl 


0 •097*1 

U \JL 1 1 


• 0-000870 


0 0265 


r 


A AT CO 

0*02jv 


#\ AT/LA 


A ATA! 
U UZ()i 


A. AT CO 






A A^O 




' 0*Q262 


0*0257 


0-000540- 


0*0257 


D 


0*0248 


0*0256 


0 0256 


0 0758 


0 0?01 ' 


.(}*025? 


0 0259 


' 0*6258 


0*0258 


0-0254 


0*000'380 


'0*0256*^ 


U 


^0*0238 
"O 0222 


00238 


0*0218 


0*0233 


0 0240 A 


0*0231 


0 0*234 


, 0 Q232 


0*0240 


0*0237 


0*000331 


0*0236 


H 


0 0226 


- 0 0208 


0*0218 ' 


O0216 


.0 0220 


0*0222 


* 0'02K4 


^ 0*0224 


0*0217 


0*000529 


0*0219 


P 


0-0220 


0*0210 


0-0217 


00Z18 


^0 021 5 


-0 0223 


.00213 


0*0227 


0*0215 


•00224 


0*00053 r 


0*0218 


G 


0 0156 


0 0156 


00159 


0*0146 


00160 


- 0 01 02 


. 00151 


0*0165 


0*dl55 


0*0153 


0*000554 


0*0156 


Y 


0 01?6 


P0129 


.00125 


0-0124 


0*0122 


^03^0122 
<^0CSS . 


0*0123 • 


' 00119 


0*0125' 


'0*0127 


0*000310 


0*0123 


B 


0 0086 


00089 


0 0087 


» 0 0092 


00086 


0*0089 


O*0^;84 


O-OOvo 


0*0090 


0 000257' - 


. 0-0089 


V 


0*0071 


^ 0*0072 


0*0071 


0*0^^71 


0 0076 


0*0076 


0*0073 


00078 


0*0074 


0*0069^'* 


0*000285' 


•0*0073 




0*0061 


^ 0*0061 


0»006.^ 


; 0 00^4' 


0*0063 


0*006'< 


0*C»062 


' 0*0061 


0 0064 


d^0060 


0*000137 


0 0063 


W 


0 0052 


0 0055 


0 0056 


o*oai8 


0 0057 


00056 


0*0055 


OOOf;' 


0*0056 


0*0050 


0*000303 


0*0054 


X 


0 0026 


0*0030 


• 00026 


0*0031 


0 0025 


0 0027 


0*002* 


O*0e25 


0 0027 


0-0028 


0*000199 


0*0027 


K 


0*0022 


00022. 


0 003a 


' 0 0022^ 


0-0022 


• ^ 0 0022 


«>o*oo:^ . 


0 00Z3 


0*0023 


0*0023 * 


0 000067" 


0*0022 




0*0020 


00021 


\)0021 


0 0020 


0 0022 


"0 0020 


0-0023 


0*0019 


0*0019 


0*0017 


0 000169 1 


0*0020 * 




- 0*0019 


0*0016 


0*0015 


0'0019 


* 0 0015 


00017 


0*0017 


0*0013 


0,'0013 


'0*0020 


0*000246 


•0-0016 


Q 


^ 0*0018 


0*0019 


0-b018 


0 0017 


^ 0 0019 


'0*0020 


0*001S 


0*0019 


0*0018 


0*001 r 


0^000095 


0*0018 


Z 


0 0016 


0*0015 


0*0015 
f 


. 0 0016 


0*0015 


'0 0015 


0*0014 


0 0015 


0*0016 


0*0017 


0*000084 


0*0015 



and the probabilities of the characters 'are, Pj^ ^- • ^Pi,;- 
Suppose * that the i ' th probabil ity. changes by tne amount d , 
so that p^ = Pj^+d^ IS £ne n^w. *i th probabil i^ty • Tne , neV 
average code wota lengtn is * • 

m • * m ^ m ^ 

. =^Pi"i =5(P,+d^)n^ - n+<^in,- 



Let D =^a ri.. Then since ^*a. = 0 , D =; ^<3v(n -n„,. There 

are two interes.ting- cases' to consider. The^fir^t occurs when 

d.>0 for i=l , 2 , . . . . ,m-l . Then, since n -n < 0 , • D< '0 so 
J-"" » * 1 m— , — 

n\< n.^' The second case, ^oc^curs when d^ < 0 for 
1=1 , 2 , . . . . ,m-l . Then n' >,n. If the changes d. ' are 



restricted so that 



Id- . > 



then 



P .= 2-(-d,)(n„-H,,< ^ a' = iLU^. , 

« 

^^^^ 1 2 ^ < l-(^)'""'^ < 1. It appears 'tnap as' long as* 

the distribution of characters cnanges only sligntly, trom 
data base to. data base, a Huffman code designed tor one .o^ 
the data bases will be- adequate tor tne otners, t'urtner 
sfudy of the variation of Huffman codeB with respect to 
chan'ges in the data base is needed before mor e ' datailfed 
statements can be-'made about the performance oi^ iiuffman 
codes when sucn^-cirarrges^'bccur.. 
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CONCLUSIONS • 



Several types of compression methods have been dis- 
cussed along'.with the underlying coding theory and the meas- 
^ ZIL^°' evalu^tihg the ef f ectivene'ss of a compression 
method. It .was shown that the data compresgion problem is 
tne- same as the optimal cod ing probdem when the data file is 
considered as a collection of independent characters. Since 
data cliaracters are -generally not independent,, the optimal 
code ^m^y be even shorter than that predicted. by the noise- 
less QoQinq theorem, thus possibly permitting even greater 
compression. A good measqre of ^ the effectiveness • of the 
method IS not tfie- percent reduction, used in " some of the 
referenced papers, bul the. ratio of the entropy H(x) of the 
data file to the average encoded character, size in bits If, 
the compression is at least as good as the optimal code then 
the ratio. is- greater than or equal to .1,' otherwise it is 
less than one. 



» : Ine. steps to be followed in seleetihg br determining c 
need tor a data compression' metnoa involve tne calculation 
Of ,tnfe, entropy of ttpe data. . The-se- steps are; 

-w 

1. weasur-e H(X), where ^ . 

• ^'<%\?^Pi^°52<Pi),- • . J 

In -the above formula for .H(X) ,• p. =f/F, wnere f. is tne 
frequency of the I'tn type of element of the data file, and 

; . ' ■ ^ -X 

•F is the totil number of elements i,n the file (F=^"f.), and 

N - IS the, number^f distinct types o/ elements.' As in sec- 
tion 3.1, the data file is composed of a sequence -of ele- 
ments which are usually characters. I^, ASCII data files, 
thert are 128- different types of characters that may occur 
m the^ile; however* since control characters usually do 

2?^!°''^ ^i^^' "^"^^ ^^^^^ ^i^^s will have, only 96 pos- 

sib-le.tfpes 6f characters,' Alternatively H can be calculat- 
ed from the equivalent expression 



a 



V 



H(X) = (l/F)^^^t^log^(£^^ - log^iF) " 
bii summing tne value? f nog^ (f ) • for each character,' dividing 
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by P and then subtracting 10^2 (F) . For large data files, it, 

IS 'not necessary to base the calculations on the Entire 
file., but only on part of the. file/ say the first 100,000 
by»tes if the file is homog^nepus A or jone can use some random 
sampling procedure to estimate the frequencies f^, 

2. Determines the current average character length n in 
bits,. For ASCII ana EBCDIC tiles this value will usually be 
b. It H(X) IS much less than n then a statistical compres- 
sion method will be effective. If, on the other hand, h:(X) 
IS qlose to n then such methoas will not be effective;' How- 
ever some type ot pattern substitution .may be applicable. 
For example, if H.(X)=7 and the current code-word iengtn is ,8 
tnen some improvement would be expected dy .compressing the 
data, , but, on the otner nand a greater improvement is to be 
expected when h(X)=5 and the current Iengtn is 8, 

3, If the data is numerical', then a nu?^rical .method 
such, ^as polyaomial predict9rs and polynomial curve fitting 
algorithms [5-9] may fee superior to the methods ,d iscussed . in 
this reports . 

' 4. If the data is text or a combination of text .and 
numerical tables; a'nd the data is compressible as indicated 
in step 2„ th^en either a digraph method or a Huffman method^ 
would compress the data. The digraph i^iethod is much easier" 
to* implement, and runs faster than the Huffman method, while 
the latter obtains a higher degree of compression. The 
choice of the compression method will depend on the ctiaracr 
teristics and applications" of the data, Bata files which 
aontatn mostly num'eric ^elds would' be compressible by an 
entirely different algorithm' than would. text files, ^Fre- 
quently accessed files may need an^ algorithm which 
quicker than ^hat for less frequently accessed files, even 
though ttie data^compression obtained by the faster algorithm 
IS far less then* opt imal . Witt^^in the, same file system p'^rts 
of. the file may be more e^fi'ciently compressea witn . dif- 
ferent methods. The diq^tlonary* of an information management 
system may.be compressea with a simple yet 'fast algorithm, 
while the* corresponding data files, because tney ate' infre-, 
quently accessedj» may be compressed with a more complex ai- 



* The d ictionary - as used here, reefers to * the collection 
ot- pointers of an inverted file system. .Each, pointer , 
by pointing to a record of the • file, functions ' in a 
manner analogous to a word of an English language 
diction^r^ , . ' - ^ f ' 



va^^lm. ? f^^"!' ''"'^ realizes "more compression. A 

timaf irvs ^^^^^ some ot the op- 

timal properties of the Hufiman code; mav. be used -to 
•compress the dictionary. ""ly pe used _ to 

' ™ 5* effectiveness of a particular data compression- 

method can be measured by comp^r ing the . aver age char acJer 
iTlll Of 'tl '"'f ^ii-after it'has^E^en comprelsed ^of't'^e 
yalu6 of the entropy ol the file. If the average' character 
length, after compression, is blose- to the value of the 'enl 
tropy then the method is as effective as an opUmal ^tltilt- 
ical compre-ss.ion method.' if the vailue of ?he' average is 
tSei tVel^J '""''^ ^'""'"'^ thanlhe value ol the en^JopJ! 
sJbL i compre^ion method is rrot as effective as..pos- 

Dlicavfnn relevant to a data processing ap-' 

plication,, when its use is significant or meaningful ?o .the 

th: folJo^ing! ''''''' °' 

• , 1. Sighif icai^^Jst reduction 

2. Signif ican^^P'orage reduction 

!{;i?i^°nj!I^ the^mplementation of the, application 

« nut r»ot have been implemented 

. aue to ansufticient storage ' ■ 

time decrease m the - data, transfer 

'Jh^-hf °^ ^"'^^ is-signi-ticant.to a-user is relative to 
dJsc stor.o"''''°"'"r'^ ^ ""^ni-computer user: with limited 

disc storage, a reduction ot a f^w thousand, bytes ot storage 
,ma.y- be significant,, while to a Ucge system -user^uch a 
■ston^'crwr^ • -^^g,-^ — t. .lit the ultim": dec.l 
on r^. L ^ ^^'^^ compression is relevant depends 

on the users special requirements and judgement,' the follow- 
ing three guidelines will be 'applicable in most cases. 

^^ta is small, say under 

sZrt'\^''^T' '''^ the^da^fts 

Short, then data compression would not be. advis- 
able . . V* Y ^.S 

2. Large data files, .over 100,000- bytes, the life 
of which- .is not short,, are good candida\s for 
data compression. . ■ 

3. A group of data files, where the files have 
fi^'^'itl ^**^'^^cter composition, is a good candidate 
for data compression when the size of the group "is 
more than 100,000 bytes. • . . 



** see section 3.3 ~^ \ 
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